Latent diffusion model

https://gyazo.com/12ee3e45c3abea0038e6cec937d0cf4c

提案論文：2112.10752 High-Resolution Image Synthesis with Latent Diffusion Models

論文ダウンロード時間かかる基素.icon

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.

Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining.

However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.

To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.

Our Latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at this https URL .

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

2021

Latent Diffusion AIでテキストから画像を生成する - TeDokology

従来の拡散モデル(Diffusion Model)は、学習データに段階的にノイズを加えていき、全ての情報が失われて完全なノイズになる過程を逆向きに辿ることでモデルを学習させています。

この拡散モデルは、画像生成タスクにおいて最先端のパフォーマンスを発揮しますが、トレーニングや推論に膨大なGPUリソースを必要とする場合がありました。

潜在拡散モデル(Latent Diffusion Models)では、クロスアッテンションレイヤーを導入し、計算量を大幅に削減しながら、従来に匹敵するパフォーマンスを実現しています。

Latent Diffusion AIでテキストから画像を生成する - TeDokology

画像生成 AI の最前線！拡散モデル・画像生成モデルの最新研究を解説

Diffusion Modelの課題

計算量が膨大になる

原理

これまでの拡散モデルでは、画像をピクセルの配列として直接扱い、ピクセル単位でノイズの付加と除去を計算しました。https://ja.stateofaiguides.com/20221012-stable-diffusion/

画像のノイズを徐々に除去するように、すなわち、画像の RGB ピクセル空間上で直接、ノイズから画像を生成します。

ニューラルネットワークをこの高次元空間上で訓練し

生成時にも何度も推論パスを実行しなければならない

情報量のほとんどが、ほとんど人間には知覚できない画像のディテール (高周波成分) を表現するのに無駄に使われてしまう

画像を尤度ベースの生成モデルで表現しようとするとこうなるらしい

人間にだけ意味があればいいという抽象化がベイヤー配列と似てる基素.icon

本論文では、入力画像と知覚的にはほぼ同一でありながら、より計算コストの低い潜在空間上で学習・推論する潜在拡散モデル (latent diffusion models; LDM) を提案しています。LDM は、一言で言うと、「VQGAN（2020）の拡散モデル版」

VQGAN (Esser et al., 2020) と同様の仕組みを使い、入力をより低次元・意味的な潜在表現 z に変換し

潜在表現：画像の大まかな特徴をとらえたベクトルhttps://ja.stateofaiguides.com/20221012-stable-diffusion/

ノイズから潜在表現を生成する拡散過程を学習します

潜在空間内で拡散モデルを適用するhttps://ja.stateofaiguides.com/20221012-stable-diffusion/

生成された潜在表現は、デコーダーを用いて画像へと変換されます

つまり、ノイズから画像を生成する代わりに、ノイズから潜在表現をまず生成し、それを画像へと戻す、という２段階のプロセスを経て画像を生成します。

VQGAN ではトランスフォーマーを使って潜在表現の分布をモデル化していますが、Transformerは原理的に１次元の系列にしか対応していないのと違い、CNN ベースの U-Net では２次元の構造を捉えられるため、潜在空間に圧縮した場合でも非常に質の高い再構成ができるということです。

Stable Diffusion を基礎から理解したい人向け論文攻略ガイド【無料記事】

Stable Diffusion は、LAION-Aesthetics と呼ばれる「美しい」画像のみを集めたデータセットを用いて学習されている点も特徴的です。